Back

BioData Mining

Springer Science and Business Media LLC

Preprints posted in the last 90 days, ranked by how well they match BioData Mining's content profile, based on 15 papers previously published here. The average preprint has a 0.03% match score for this journal, so anything above that is already an above-average fit.

1
Identifying genes associated with phenotypes using machine and deep learning

Muneeb, M.; Ascher, D.

2026-03-07 bioinformatics 10.64898/2026.03.05.709665 medRxiv
Top 0.1%
8.2%
Show abstract

Identifying disease-associated genes enables the development of precision medicine and the understanding of biological processes. Genome-wide association studies (GWAS), gene expression data, biological pathway analysis, and protein network analysis are among the techniques used to identify causal genes. We propose a machine-learning (ML) and deep-learning (DL) pipeline to identify genes associated with a phenotype. The proposed pipeline consists of two interrelated processes. The first is classifying people into case/control based on the genotype data. The second is calculating feature importance to identify genes associated with a particular phenotype. We considered 30 phenotypes from the openSNP data for analysis, 21 ML algorithms, and 80 DL algorithms and variants. The best-performing ML and DL models, evaluated by the area under the curve (AUC), F1 score, and Matthews correlation coefficient (MCC), were used to identify important single-nucleotide polymorphisms (SNPs), and the identified SNPs were compared with the phenotype-associated SNPs from the GWAS Catalog. The mean per-phenotype gene identification ratio (GIR) was 0.84. These results suggest that SNPs selected by ML/DL algorithms that maximize classification performance can help prioritise phenotype-associated SNPs and genes, potentially supporting downstream studies aimed at understanding disease mechanisms and identifying candidate therapeutic targets.

2
Improving Causal Gene Identification Using Large Language Models

Ofer, D.; Kaufman, H.

2026-03-10 bioinformatics 10.64898/2026.03.08.710344 medRxiv
Top 0.1%
6.0%
Show abstract

Genome-Wide Association Studies (GWAS) have successfully identified numerous loci associated with complex traits and diseases, yet pinpointing causal genes remains a significant challenge. The reliance on simple proximity-based heuristics is often insufficient due to linkage disequilibrium, gene interactions, and regulatory effects. Recent advancements in Large Language Models (LLMs) have demonstrated potential in automating causal gene identification, but their effectiveness remains limited by knowledge representation and retrieval mechanisms. This study builds on previous research by evaluating LLMs for causal gene identification, with a focus on enhancing performance through Retrieval-Augmented Generation (RAG) and the incorporation of genomic distance information. We replicate prior results using smaller model Qwen2.5--assessing their predictive accuracy using a benchmark dataset from Open Targets. We improved the preformences when integrating RAG-based literature retrieval (F1 = 0.795) and gene distance information (F1 = 0.806). However, the combined approach yielded diminishing returns, suggesting interactions between these enhancements. Error analysis revealed that genomic distance features improved predictions by reinforcing established heuristics, while RAG enhanced domain knowledge but occasionally led to semantic biases. These findings highlight the potential of hybrid approaches in leveraging both structured genomic features and unstructured textual data.

3
Benchmark of biomarker identification and prognostic modeling methods on diverse censored data

Fletcher, W. L.; Sinha, S.

2026-04-01 bioinformatics 10.64898/2026.03.29.715113 medRxiv
Top 0.1%
4.9%
Show abstract

The practices of identifying biomarkers and developing prognostic models using genomic data has become increasingly prevalent. Such data often features characteristics that make these practices difficult, namely high dimensionality, correlations between predictors, and sparsity. Many modern methods have been developed to address these problematic characteristics while performing feature selection and prognostic modeling, but a large-scale comparison of their performances in these tasks on diverse right-censored time to event data (aka survival time data) is much needed. We have compiled many existing methods, including some machine learning methods, several which have performed well in previous benchmarks, primarily for comparison in regards to variable selection capability, and secondarily for survival time prediction on many synthetic datasets with varying levels of sparsity, correlation between predictors, and signal strength of informative predictors. For illustration, we have also performed multiple analyses on a publicly available and widely used cancer cohort from The Cancer Genome Atlas using these methods. We evaluated the methods through extensive simulation studies in terms of the false discovery rate, F1-score, concordance index, Brier score, root mean square error, and computation time. Of the methods compared, CoxBoost and the Adaptive LASSO performed well in all metrics, and the LASSO and elastic net excelled when evaluating concordance index and F1-score. The Benjamini-Hoschberg and q-value procedures showed volatile performances in controlling the false discovery rate. Some methods performances were greatly affected by differences in the data characteristics. With our extensive numerical study, we have identified the best performing methods for a plethora of data characteristics using informative metrics. This will help cancer researchers in choosing the best approach for their needs when working with genomic data.

4
dgiLIT: A Method for Prioritization and AI Curation of Drug-Gene Interactions

Cannon, M. J.; Bratulin, A.; Stevenson, J. S.; Perry, K.; Coffman, A.; Kiwala, S.; Schimmelpfennig, L.; Costello, H.; McMichael, J. F.; Griffith, M.; Griffith, O. L.; Wagner, A. H.

2026-01-19 bioinformatics 10.64898/2026.01.16.699733 medRxiv
Top 0.1%
4.8%
Show abstract

IMPORTANCEThe Drug-Gene Interaction Database (DGIdb) has a long history of driving hypothesis generation for biomedical research through the careful curation of drug-gene interaction data from primary and secondary sources with supporting literature. Recent advances in large-language model (LLM) and artificial intelligence (AI) technologies have enabled new paradigms for knowledge extraction and biocuration. The accelerating growth of biomedical literature presents a significant challenge for maintaining up-to-date interaction data. With more than 38 million citations indexed in PubMed alone, new strategies must evolve to identify and incorporate new interaction data into DGIdb. OBJECTIVEIdentify new cost-effective AI curation strategies for incorporating new drug-gene interactions into DGIdb. METHODSWe present a methodology that leverages deterministic natural language processing techniques, existing harmonization frameworks, and AI-assisted curation to systematically narrow the literature space and identify new drug-gene interactions from published studies for inclusion in DGIdb. RESULTSWe demonstrate the use of lemmatization to prioritize a set of 100 abstracts containing high amounts of interaction words for downstream AI curation. From our set of abstracts, we were then able to identify 137 drug-gene interactions via an AI curation task, with 121 (88.3%) of these interactions being completely novel to DGIdb. A human expert evaluator reviewed this interaction set and was able to validate 134 of 137 (97.8%) interactions as being valid based on the text provided. CONCLUSIONTaken together, our results highlight a promising, cost-effective method of ingesting new interactions into DGIdb.

5
Biomedical Large Language Models and Prompt Engineering for Causality Assessment of Individual Case Safety Reports in Pharmacovigilance

Heckmann, N. S.; Papoutsi, D. G.; Barbieri, M. A.; Battini, V.; Molgaard, S. N.; Schmidt, S. O.; Melskens, L.; Sessa, M.

2026-02-24 pharmacology and therapeutics 10.64898/2026.02.19.26346467 medRxiv
Top 0.1%
4.4%
Show abstract

BackgroundBiomedical Large Language Models (LLMs) combined with prompt engineering offer domain-specific reasoning, yet their application to individual-level causality assessment remains unexplored. This study evaluated five combinations of biomedical LLMs, prompting strategies, and causality algorithms by comparing their agreement with two human expert evaluators. Research design and methodsA total of 150 Individual Case Safety Reports (ICSRs) were analyzed: 140 reports from Food and Drug Administration Adverse Event Reporting System (FAERS), and 10 myocarditis/pericarditis ICSRs from Vaccine AERS (VAERS). Assessments were conducted using the Naranjo and WHO-UMC algorithms. Biomedical LLMs tested included TinyLlama 1.1B, Medicine LLaMA-3 8B, and MedLLaMA v20, combined with Chain-of-Thought (CoT) or Decomposition prompting. Agreement was measured using Gwets Agreement Coefficient 1 (AC1) and percentage agreement, alongside performance metrics and qualitative error analysis. ResultsThe Medicine LLaMA-3 8B-Naranjo-CoT combination achieved the highest agreement with human assessors for the final classification of causality (64%). Biomedical LLMs demonstrated low inter-rater agreement on critical items of causality assessment such as identification of listed AE, temporal plausibility, alternative causes, and objective evidence of AEs. Frequent model failures included irrelevant responses. ConclusionsBiomedical LLMs showed improved performance over general purpose models previously tested but remain suboptimal for reliable causality assessment of ICSRs.

6
Reusing Blood Samples from a Hospital-based Cohort to Apixaban Plasma Concentrations

Murray, K. T.; Fabbri, D. V.; Annis, J. S.; Clark, C. R.; Pulley, J. M.; Brittain, E.; Gailani, D.

2026-04-08 pharmacology and therapeutics 10.64898/2026.04.07.26350322 medRxiv
Top 0.1%
4.3%
Show abstract

In the management of atrial fibrillation, the most frequently prescribed oral anticoagulant is apixaban, given at a fixed dose of 5mg BID. Apixaban is predominantly metabolized by cytochrome P4503A4 (CYP3A4) and is also a substrate for the drug efflux transporter P-glycoprotein (P-gp). In nearly 300,000 Medicare patients with AF receiving apixaban, we previously showed that concomitant therapy with drugs that inhibit both CYP3A4 and P-gp, specifically amiodarone or diltiazem, significantly increased serious bleeding that caused hospitalization and/or death. We hypothesized that this adverse effect was mediated by an increase in apixaban plasma concentrations caused by concomitant therapy that reduced drug elimination. Utilizing left-over samples obtained from clinically indicated blood draws that would typically be discarded, the Vanderbilt University Medical Center biobank BioVU contains >353,000 samples linked to de-identified electronic medical records (EMRs), with both DNA and plasma harvested. Of 35 samples drawn from patients taking apixaban 5mg BID, 5 were identified to be drawn from patients concomitantly taking drugs inhibiting both CYP3A4 and P-gp. Using a chromogenic anti-Xa assay, we found that plasma concentrations of apixaban were significantly higher (347{+/-}64 ng/mL; mean{+/-}SEM) for patients receiving concomitant CYP3A4/P-gp-inhibiting drugs compared to those not treated with these drugs (166{+/-}67 ng/mL; P=0.025, Mann Whitney). There were no differences between the 2 patient groups with respect to age, weight, or serum creatinine. The results of this pilot study provide preliminary data to support our hypothesis, and they demonstrate the practicality of obtaining pharmacokinetic data from a large cohort of plasma samples linked to deidentified EMRs. This approach could be used to define the role of apixaban levels in high-risk clinical scenarios and to better understand the relationship between drug levels and bleeding risk.

7
Benchmarking 80 binary phenotypes from the openSNP dataset using deep learning algorithms and polygenic risk score tools

Muneeb, M. -; Ascher, D.; Myung, Y.; Feng, S.; Henschel, A.

2026-03-09 bioinformatics 10.64898/2026.03.06.710126 medRxiv
Top 0.1%
4.3%
Show abstract

Genotype-phenotype prediction plays a crucial role in identifying disease-causing single nucleotide polymorphisms and precision medicine. In this manuscript, we benchmark the performance of various machine/deep learning algorithms and polygenic risk score tools on 80 binary phenotypes extracted from the openSNP dataset. After cleaning and extraction, the genotype data for each phenotype is passed to PLINK for quality control, after which it is transformed separately for each of the considered tools/algorithms. To compute polygenic risk scores, we used the quality control measures for the test data and the genome-wide association studies summary statistic file, along with various combinations of clumping and pruning. For the machine learning algorithms, we used p-value thresholding on the training data to select the single nucleotide polymorphisms, and the resulting data was passed to the algorithm. Our results report the average 5-fold Area Under the Curve (AUC) for 29 machine learning algorithms, 80 deep learning algorithms, and 3 polygenic risk scores tools with 675 different clumping and pruning parameters. Machine learning outperformed for 44 phenotypes, while polygenic risk score tools excelled for 36 phenotypes. The results give us valuable insights into which techniques tend to perform better for certain phenotypes compared to more traditional polygenic risk scores tools.

8
A systematic assessment of machine learning for structural variant filtering

Kalra, A.; Paulin, L.; SEDLAZECK, F.

2026-01-30 bioinformatics 10.64898/2026.01.27.702059 medRxiv
Top 0.1%
4.0%
Show abstract

BackgroundAccurate discrimination of true structural variants (SVs) from artifacts in long-read sequencing data remains a critical bottleneck. Numerous machine learning solutions have been proposed, ranging from classical models using engineered features to advanced deep learning and foundation model interpretability methods. However, a systematic comparison of their performance, efficiency, and practical utility is lacking. ResultsWe conducted a comprehensive benchmark of five machine learning paradigms for SV filtering using standardized Genome in a Bottle (GIAB) data for samples HG002 and HG005. We evaluated classical Random Forest classifiers on 15 genomic features, computer vision models (ResNet/VICReg), diffusion-based anomaly detection, sparse autoencoders (SAEs) on the Evo2-7B foundation model, and multimodal ensembles. A simple Random Forest on interpretable features achieved a peak F1-score of 95.7%, effectively matching all more complex models (ResNet50: 95.9%, Diffusion: 95.8%). This study represents the first application of diffusion-based anomaly detection and sparse autoencoders to structural variant analysis; while diffusion models learned highly discriminative, disentangled representations and SAEs uncovered biologically interpretable features (including atoms that were specific for ALU deletions, chromosome X variants and insertion events), they did not significantly surpass this classification ceiling. Ensemble methods offered no performance benefit but may have future potential given the orthogonality of vision-based and linear features. ConclusionsOur findings demonstrate that for the established task of germline SV filtering, simpler, interpretable models provide an optimal balance of accuracy, speed, and transparency. This benchmark establishes a pragmatic framework for method selection and argues that increased model complexity must be justified by clear, unmet biological needs rather than marginal predictive gains.

9
From SNPs to Pathways: A genome-wide benchmark of annotation discrepancies and their impact on protein- and pathway-level inference

Queme, B.; Muruganujan, A.; Ebert, D.; Mushayahama, T.; Gauderman, W. J.; Mi, H.

2026-03-24 bioinformatics 10.64898/2026.03.21.713397 medRxiv
Top 0.1%
3.9%
Show abstract

BackgroundAccurate single-nucleotide polymorphism (SNP) annotation is central to genomic research yet widely used tools and gene models often yield divergent results. Prior studies have shown such discrepancies in small datasets, but the extent of genome-wide variation and its impact on downstream pathway analysis remain unclear. ResultsWe conducted a comprehensive comparison of three commonly used SNP annotation tools, ANNOVAR, SnpEff, and VEP, using both Ensembl and RefSeq gene models to evaluate more than 40 million SNPs from the Haplotype Reference Consortium. At the protein level, annotation output differed significantly across tools and gene models (p-adj < 0.001), with discrepancies present in both genic and intergenic regions. RefSeq produced broader annotation coverage, particularly for intergenic SNPs, while Ensembl showed greater internal consistency. SnpEff provided the most complete coverage overall, whereas no single tool or model configuration achieved full annotation recovery of the union reference. Integration across tools and models maximized coverage and reduced annotation loss. In a case study of 204 colorectal cancer-associated SNPs from the FIGI GWAS, pathway enrichment results varied depending on annotation strategy. The fully integrated approach identified all four significant pathways, whereas several single-tool or single-model strategies missed one or more. ConclusionSNP annotation outcomes are influenced by both the tool and gene model used, and relying on a single approach may result in incomplete coverage. A multi-tool, multi-model strategy provides the most comprehensive annotation and preserves enriched pathways, supporting more robust and reproducible genomic interpretation.

10
Predicting long-term adverse outcomes after neonatal intensive care

Ogretir, M.; Kaipainen, V.; Leskinen, M.; Lahdesmaki, H.; Koskinen, M.

2026-03-31 pediatrics 10.64898/2026.03.26.26348580 medRxiv
Top 0.1%
3.6%
Show abstract

Neonates requiring intensive care are at increased risk for long-term neuropsychiatric disorders. However, clinical adoption of risk prediction models remains limited when their performance lacks adequate interpretability for informed clinical decision-making. Here, we investigated whether longitudinal neonatal electronic health record (EHR) data from the first 90 days of life can support clinically meaningful interpretation of long-term risk signals for major neuropsychiatric diagnoses by age seven. In a retrospective register-based cohort of 17,655 at-risk children from an academic medical center, of whom 8.0\% (1,420) received a major neuropsychiatric diagnosis during follow-up, we applied a time-aware transformer model (Self-supervised Transformer for Time-Series; STraTS) and thoroughly evaluated its predictions using three complementary interpretability approaches: perturbation-based variable importance, value-dependent effect analysis, and leave-one-out (LOO) feature attribution. STraTS achieved the highest area under the precision--recall curve (AUPRC 0.171 {+/-} 0.022), compared with Random Forest (0.166 {+/-} 0.008), logistic regression (0.151 {+/-} 0.007), and XGBoost (0.128 {+/-} 0.010). Across interpretability methods, five predictors were consistently identified: birth weight, gender, Apgar score at 1 minute, umbilical serum thyroid stimulating hormone (uS-TSH), and treatment time in hospital. Indicators of early clinical severity, including chromosomal abnormalities and neonatal cerebral-status disturbances, showed the largest risk-increasing effects. Furthermore, the model's learned vector representations of subject-specific EHR sequences formed clinically coherent latent embeddings that reflect population heterogeneity along established perinatal risk dimensions. These findings demonstrate that combining multiple complementary interpretability methods yields stable, clinically plausible risk signals while revealing limitations that would remain undetected by any single approach, highlighting the importance of careful interpretability analysis of deep learning-based risk predictions.

11
Comparing optimal transport and machine learning approaches for databases merging in scenarios involving missing data in covariates.Application to Medical Research

N'kam suguem, F.; DEJEAN, s.; Saint-Pierre, P.; Savy, N.

2026-01-26 bioinformatics 10.64898/2026.01.23.701369 medRxiv
Top 0.1%
3.6%
Show abstract

MotivationOne of the challenges encountered when merging heterogeneous observational clinical datasets is the recoding of categorical target variables that may have been measured differently across data sources. Standard machine learning-based approaches, such as Multiple Imputation by Chained Equations and the k-Nearest Neighbours method are compared with an Optimal Transport based algorithm (OTre-cod) when databases are altered by missing values in covariates or by imbalanced groups. The empirical performance in these realistic data integration settings remains underexplored. ResultsA comprehensive simulation study was conducted, varying sample size, group imbalance, signal-to-noise ratio, and mechanisms of missing data. The results demonstrate that OTrecod consistently achieves higher recoding accuracy compared with Multiple Imputation by Chained Equations and k-Nearest Neighbours, particularly in large, imbalanced and weak-signal scenarios. These findings are further illustrated using subsets of the National Child Development Study, where OTrecod and Multiple Imputation by Chained Equations minimised the distributional divergence between recoded social-class scales, while k-Nearest Neighbours produced less stable results. Availability and ImplementationThe source code supporting this study is publicly available at https://github.com/FloAI/CompareOT.

12
A Proof-of-Concept Study of a Clinical Decision Support System for Vancomycin Therapeutic Monitoring

Hassan, F.; Lou, J. Y.; Lim, C. T.; Ong, W. Q.; Rumaizi, N. N.

2026-03-02 pharmacology and therapeutics 10.64898/2026.02.22.26346368 medRxiv
Top 0.1%
3.4%
Show abstract

Artificial intelligence (AI), particularly large language models (LLMs), is increasingly explored in healthcare, yet its real-world usability and safety in high-risk clinical pharmacy tasks remain uncertain. Vancomycin therapeutic drug monitoring (TDM), which requires precise pharmacokinetic calculations and context-sensitive interpretation within a narrow therapeutic window, provides a stringent test case for AI-assisted decision support. This proof-of-concept study developed and evaluated a hybrid clinical decision support system (TDM-AID) integrating a validated deterministic pharmacokinetic calculation engine, GPT-4o-based structured clinical interpretation, and retrieval-augmented guideline support. Thirty retrospective adult vancomycin TDM cases were assessed using a weighted six-domain rubric covering pharmacokinetic accuracy, AUC estimation, prospective prediction, timing recommendations, clinical judgment, and documentation quality. Two independent expert pharmacists evaluated system outputs against benchmark consultations. The overall median performance was 78% (IQR 12%), classified as Acceptable, and 73% (IQR 14%) when deterministic calculations were excluded. Foundational pharmacokinetic calculations achieved 100% accuracy. Clinical judgment demonstrated Good performance (83%), whereas prospective prediction was limited (58%), and timing recommendations were absent in all cases. Safety violations occurred in 17% of cases, including dose recommendations exceeding 4 g/day. Inter-rater reliability was good (ICC 0.87). These findings suggest that hybrid AI-driven decision support is technically feasible and usable as a pharmacist-augmenting draft generator; however, limitations in predictive reasoning, timing logistics, and safety enforcement necessitate deterministic safeguards and mandatory expert oversight before clinical implementation.

13
Early life factors documented in electronic health records predict recurrent acute otitis media

Hurst, J. H.; Zhao, C.; Raynor, E. M.; Lee, J.; Gitomer, S. A.; Woods, C. W.; Kelly, M. S.; Smith, M. J.; Goldstein, B. A.

2026-03-09 pediatrics 10.64898/2026.03.07.26347843 medRxiv
Top 0.1%
3.3%
Show abstract

Background and ObjectivesRecurrent acute otitis media (rAOM; defined as [&ge;]3 AOM episodes in 6 months or [&ge;]4 episodes in 12 months) affects 10-15% of children in the United States and is a leading cause of healthcare utilization and antibiotic prescriptions. Prospective identification of children at risk of rAOM could help target interventions and identify new risk factors to guide preventive approaches. We therefore sought to develop predictive models to identify children at risk of rAOM using electronic health records (EHR) data. MethodsWe extracted retrospective EHR data for children who were born at Duke University Health System (DUHS) hospitals between January 1, 2014, and June 30, 2022, and who had at least one AOM episode during the study period. We used LASSO to build predictive models for development of rAOM at each episode and identified factors associated with rAOM. ResultsWe identified 6,566 children who met the study criteria, including 1,634 (24.8%) who met criteria for rAOM. A model using only data available at the first AOM episode had an area under the curve (AUC) of 0.75 (0.73, 0.77) and an Area Under the Precision Recall Curve (AUPRC) of 0.41 (95% CI 0.37, 0.46), indicating moderate discriminative ability. At the time of the first AOM episode, features associated with subsequent rAOM development included age, number of prior antibiotic prescriptions, and diagnosis of gastroesophageal reflux disease (GERD). Further, children who developed rAOM were more likely to experience treatment failure than children who did not meet rAOM criteria across all episodes. ConclusionsOur findings indicate that clinical exposures and patient characteristics documented in the EHR distinguish children who are at risk of developing rAOM. Such models could be deployed within EHR systems to identify children who would benefit from early evaluation by an otolaryngologist and audiologist.

14
Benchmarking Heritability Estimation Strategies Across 86 Configurations and Their Downstream Effect on Polygenic Risk Score Performance

Muneeb, M.; Ascher, D.

2026-04-02 bioinformatics 10.64898/2026.04.02.716079 medRxiv
Top 0.1%
3.2%
Show abstract

ObjectiveSNP heritability estimates vary substantially across estimation strategies, yet the downstream consequences for polygenic risk score (PRS) construction remain poorly characterised. We systematically benchmarked heritability estimation configurations and assessed their propagation into downstream PRS performance. MethodsWe benchmarked 86 heritability-estimation configurations spanning six tool families (GEMMA, GCTA, LDAK, DPR, LDSC, SumHer) and ten method groups across 10 UK Biobank phenotypes, yielding 844 configuration-level estimates. Each estimate was propagated into GCTA-SBLUP and LDpred2-lassosum2 PRS frameworks and evaluated across five cross-validation folds using null, PRS-only, and full models. Eleven binary analytical contrasts were tested using Mann-Whitney U tests to identify drivers of heritability variability. ResultsHeritability ranged from -0.862 to 2.735 (mean = 0.134, SD = 0.284), with 133 of 844 estimates (15.8%) negative and concentrated in unconstrained estimation regimes. Ten of eleven analytical contrasts significantly affected heritability magnitude, with algorithm choice and GRM standardisation showing the largest effects. Despite this upstream variability, downstream PRS test performance was only weakly coupled to heritability magnitude: pooled Pearson correlations between h2 and test AUC were r = -0.023 for GCTA-SBLUP and r = +0.014 for LDpred2-lassosum2 (both non-significant). ConclusionSNP heritability is best interpreted as a configuration-sensitive modelling parameter rather than a universally stable scalar input. Heritability estimates should always be reported alongside their full estimation specification, and downstream PRS performance is comparatively robust to moderate variation in the heritability input. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=80 SRC="FIGDIR/small/716079v1_ufig1.gif" ALT="Figure 1"> View larger version (27K): org.highwire.dtl.DTLVardef@112929borg.highwire.dtl.DTLVardef@573c36org.highwire.dtl.DTLVardef@132170borg.highwire.dtl.DTLVardef@1871363_HPS_FORMAT_FIGEXP M_FIG C_FIG

15
Development and validation of an XGBoost model with SHAP-based interpretability and a web-based calculator for predicting extrauterine growth restriction in preterm infants

Xu, Z.; Yu, C.-L.; Zhang, J.-X.

2026-04-02 pediatrics 10.64898/2026.04.01.26349838 medRxiv
Top 0.1%
3.1%
Show abstract

Background: Extrauterine growth restriction (EUGR) is a common and clinically significant complication among preterm infants, contributing to adverse neurodevelopmental and metabolic outcomes. Early and individualized risk prediction remains challenging. This study aimed to develop and validate an interpretable machine learning model for early prediction of EUGR using routinely available clinical variables, and to implement a user-friendly web-based calculator for clinical use. Methods: We retrospectively analyzed 1,431 preterm infants admitted within 24 hours after birth to our hospital between May 2020 and March 2025. Infants from the Yangpu campus (n=863) formed the training set, and those from the Huangpu campus (n=568) formed the validation set. Early clinical variables available within 48-72 hours were screened using the Boruta algorithm. Logistic regression, XGBoost, random forest, decision tree, and support vector machine models were developed and compared. Model performance was evaluated using area under the curve (AUC), accuracy, sensitivity, specificity, F1 score, and Brier score. SHapley Additive exPlanations (SHAP) were applied to assess global and individual feature contributions, nonlinear effects, and interactions. A web-based calculator was constructed based on the optimal model. Results: Nine variables were identified as important predictors: birth weight, small for gestational age status, gestational age, breastfeeding, multiple gestation, neonatal respiratory distress syndrome, patent ductus arteriosus, maternal hypertension, and maternal group B Streptococcus infection. Among the five models, XGBoost achieved the best performance in the validation set (AUC 0.922, accuracy 0.849, Brier score 0.108). SHAP analysis showed that low birth weight, small for gestational age, maternal group B Streptococcus infection, and patent ductus arteriosus were major risk factors, while breastfeeding was protective. Notable nonlinear and interactive effects were observed, particularly between birth weight and gestational age and between breastfeeding and patent ductus arteriosus. The web-based calculator provides real-time individualized risk estimation and visualized interpretation. Conclusions: An interpretable XGBoost-based model and web calculator were successfully developed and validated for early prediction of EUGR in preterm infants. This tool may support clinicians in identifying high-risk infants and guiding individualized nutritional and clinical management.

16
Multimodal EHR-Based Prediction of Pediatric Asthma Exacerbations

Fan, Z.; Pan, J.; Lyu, M.; Liang, R.; Sun, C.; Wu, Y.; Fedele, D.; Fishe, J.; Xu, J.

2026-02-27 pediatrics 10.64898/2026.02.25.26347091 medRxiv
Top 0.1%
2.6%
Show abstract

Pediatric asthma exacerbations are a frequent cause of emergency department (ED) visits and hospitalizations, yet accurate risk prediction remains limited and no consensus risk scores exist. Using UF Health electronic health records (EHRs) from 2011-2023, we evaluated two computable phenotypes (i.e., CAPriCORN and COMPAC) to predict exacerbations over 6-, 12-, and 24-month horizons. Exacerbations were defined using a validated composite of diagnosis codes from ED, inpatient, or outpatient encounters combined with systemic corticosteroids prescriptions. Several commonly used machine learning (ML) models were trained with stratified five-fold cross-validation, Bayesian hyperparameter optimization, and Youdens J thresholding. XGBoost achieved the best performance, with SHapley Additive exPlanations (SHAP) highlighting note-derived symptom terms and rescue-medication use as dominant predictors. Future work will focus on external validation and assessment of generalizability. This interpretable, text-integrated framework may support child-specific risk stratification and inform EHR-based decision support for timely pediatric asthma management.

17
On why and how to encode probability distributions on graph representations of omics data: enhancing predictive tasks and knowledge discovery

Goncalves, D. M.; Patricio, A.; Costa, R. S.; Henriques, R.

2026-02-19 bioinformatics 10.64898/2026.02.19.706756 medRxiv
Top 0.2%
2.4%
Show abstract

The growing availability and complexity of omics data have driven the development of specialized algorithms for modeling molecular systems. Although graph-based learning methods effectively represent biological interactions, they often neglect the statistical information embedded in node and edge annotations. To address this limitation, we propose a novel graph-based framework that integrates structured statistical distributions into nodes and edges, capturing probabilistic characteristics of molecular relationships. We evaluate the proposed approach on omics datasets from five cancer types across multiple clinical outcomes, including survivability and primary tumor site. Results demonstrate predictive performance comparable to established machine learning baselines. Beyond prediction, the statistically enriched graph representations enable the identification and characterization of regulatory modules associated with clinical outcomes, enhancing biological interpretability. These findings suggest that incorporating structured statistical information into graph representations provides a competitive and interpretable framework for predictive modeling and knowledge discovery in complex diseases.

18
Detecting Manuscripts Related to Computable Phenotypes Using a Transformer-based Language Model

Chae, J.; Heise, D. A.; Connatser, K.; Honerlaw, J.; Maripuri, M.; Ho, Y.-L.; Fontin, F.; Tanukonda, V.; Cho, K.

2026-03-16 bioinformatics 10.64898/2026.03.12.711165 medRxiv
Top 0.2%
2.3%
Show abstract

ObjectiveThe demand for a comprehensive phenomics library, which requires identifying computable phenotype definitions and associated metadata from an ever-expanding biomedical literature, presents a significant, labor-intensive, and unscalable challenge. To address this, we introduce a transformer-based language model specifically designed for identifying biomedical texts containing computable phenotypes and piloted its use in the Centralized Interactive Phenomics Resource (CIPHER) platform. Materials and MethodsWe fine-tuned a BioBERT model using a labeled dataset of 396 manuscripts. The model incorporates our novel sliding-window approach to effectively overcome token-length limitations, thereby enabling accurate classification of full-length manuscripts. For scalable deployment and continuous refinement, we developed a cohesive framework that integrates a web-based user interface, a control server, and a classification module. ResultsThe staged approach for model development yielded a final model with 95% accuracy. The web-based user interface was deployed on the CIPHER platform and enables user feedback for model retraining. DiscussionWe developed a model and user interface which are currently in use by data curators to identify computable phenotype definitions from the literature. ConclusionThrough this system, users can submit literature, assess classification results, and provide feedback directly influencing future model training, thereby offering an efficient and adaptive solution for accelerating phenotype-driven literature curation.

19
ExposoGraph: An Interactive Platform for Carcinogen Bioactivation and Detoxification Pathway Visualization

Pienta, K.; Kazi, J. U.

2026-03-24 bioinformatics 10.64898/2026.03.22.713456 medRxiv
Top 0.2%
2.1%
Show abstract

BackgroundDespite extensive cataloging of carcinogenic exposures by the International Agency for Research on Cancer (IARC) and pharmacogenomic variation by resources such as PharmVar and CPIC, few platforms unify exposure, metabolic activation and detoxification, DNA damage, and genetic annotation within a single interactive visualization framework. This gap limits systematic evaluation of gene-environment interactions in cancer risk assessment. MethodsWe developed the Carcino-Genomic Knowledge Graph, ExposoGraph, an interactive knowledge-graph platform for carcinogen metabolism and DNA damage pathways. The reference graph integrates curated data and annotations from IARC, KEGG, PharmVar, CPIC, CTD, and supporting literature/resources. The current reference graph contains 96 nodes across 5 entity types (Carcinogens, Enzymes, Metabolites, DNA Adducts, and Pathways) and 102 edges across 6 relationship types (activates, detoxifies, transports, forms adduct, repairs, and pathway). ResultsThe first-generation reference graph captures metabolic activation and detoxification pathways for 9 carcinogen classes spanning 15 index carcinogens. It represents 36 enzymes across Phase I activation (n=14), Phase II conjugation and detoxification (n=14), Phase III transport (n=3), and DNA repair (n=5). Interactive exploration supports carcinogen-class filtering, node- and edge-type filtering, metadata-based search, and detailed hover/detail views with provenance and pharmacogenomic annotations. The androgen branch highlights cross-pathway connectivity by linking androgen metabolism to estrogen quinone formation and DNA adduct generation through CYP19A1-mediated aromatization and downstream catechol estrogen chemistry. In the optional androgen-focused extension, additional receptor, tissue, and variant context further connects this branch to androgen receptor signaling and genotype-specific annotations. ConclusionsExposoGraph provides a first-generation integrated, interactive framework linking carcinogenic exposures to metabolic fates and genetic modulators. The platform supports hypothesis generation for gene-environment interaction studies and may inform future individualized risk modeling, while remaining a research-use framework rather than a clinically validated risk-assessment tool.

20
Development and external validation of the NEO-READY model to predict date of discharge among premature neonatal intensive care patients

Lonsdale, H.; Patel, K. B.; Domenico, H.; Moore, R. S.; McCoy, A. B.; French, B.; Rosenbloom, S. T.; Byrne, D. W.; Freundlich, R. E.; Alrifai, M. W.

2026-01-22 pediatrics 10.64898/2026.01.20.26344137 medRxiv
Top 0.2%
2.1%
Show abstract

OBJECTIVETo develop a parsimonious, interpretable, and accurate model for predicting discharge for premature infants in the NICU that is suitable for prospective evaluation and integration into clinical workflows. STUDY DESIGNUsing routinely available electronic health record data, we developed and validated NEOnatal Reliable Estimation of Approaching Discharge in Young infants (NEO-READY), a daily-updating model that predicts likelihood of discharge within 5 days for premature infants. RESULTSData from 702 infants were used to develop the model, and data from 201 infants were used for temporal external validation. The model includes 13 predictors and two interaction terms and demonstrated excellent discrimination across development (AUC = 0.88, 95% CI 0.87-0.90) and validation (0.90, 0.88-0.91) cohorts. CONCLUSIONThis work represents step 1 toward our long-term goal: integrating the NEO-READY model into clinical workflows as part of a comprehensive strategy to improve discharge preparedness, reduce discharge delays, and optimize NICU resources.